Aprendiendo a caminar con PPO¶
ESTUDIANTES: MANUELA LARREA GÓMEZ

En esta práctica vamos a resolver el problema de optimizar un robot simulado para que sea capaz de caminar sobre un terreno de la forma más eficiente posible. Para ello utilizaremos el algoritmo Proximal Policy Optimization (PPO).
Instrucciones¶
A lo largo del notebook encontrarás celdas que debes rellenar con tu propio código. Sigue las instrucciones del notebook y presta atención a los siguientes iconos:

Deberás resolver el ejercicio escribiendo tu propio código o respuesta en la celda inmediatamente inferior. La nota máxima que puede obtenerse con esta clase de ejercicios es de 7 sobre 10.

Esto es una pista u observación de utilidad que puede ayudarte a resolver el ejercicio. Presta atención a estas pistas para comprender el ejercicio en mayor profundidad.

Este es un ejercicio avanzado que te puede ayudar a profundizar en el tema, y a conseguir una calificación más alta. Resolviendo esta clase de ejercicios puedes llegar conseguir hasta 3 puntos sobre 10. ¡Buena suerte!
Preliminares¶
En primer lugar vamos a instalar las librerías necesarias para ejecutar PPO, así que como para correr el entorno de simulación.
!apt install swig
!pip install box2d box2d-kengz
!pip install stable-baselines3[extra] gymnasium[box2d]
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following additional packages will be installed:
swig4.0
Suggested packages:
swig-doc swig-examples swig4.0-examples swig4.0-doc
The following NEW packages will be installed:
swig swig4.0
0 upgraded, 2 newly installed, 0 to remove and 45 not upgraded.
Need to get 1,116 kB of archives.
After this operation, 5,542 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu jammy/universe amd64 swig4.0 amd64 4.0.2-1ubuntu1 [1,110 kB]
Get:2 http://archive.ubuntu.com/ubuntu jammy/universe amd64 swig all 4.0.2-1ubuntu1 [5,632 B]
Fetched 1,116 kB in 0s (2,797 kB/s)
Selecting previously unselected package swig4.0.
(Reading database ... 121913 files and directories currently installed.)
Preparing to unpack .../swig4.0_4.0.2-1ubuntu1_amd64.deb ...
Unpacking swig4.0 (4.0.2-1ubuntu1) ...
Selecting previously unselected package swig.
Preparing to unpack .../swig_4.0.2-1ubuntu1_all.deb ...
Unpacking swig (4.0.2-1ubuntu1) ...
Setting up swig4.0 (4.0.2-1ubuntu1) ...
Setting up swig (4.0.2-1ubuntu1) ...
Processing triggers for man-db (2.10.2-1) ...
Collecting box2d
Downloading Box2D-2.3.2.tar.gz (427 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 427.9/427.9 kB 5.3 MB/s eta 0:00:00
Preparing metadata (setup.py) ... done
Collecting box2d-kengz
Downloading Box2D-kengz-2.3.3.tar.gz (425 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 425.4/425.4 kB 32.9 MB/s eta 0:00:00
Preparing metadata (setup.py) ... done
Building wheels for collected packages: box2d, box2d-kengz
Building wheel for box2d (setup.py) ... done
Created wheel for box2d: filename=Box2D-2.3.2-cp310-cp310-linux_x86_64.whl size=2367314 sha256=d227a2571b9da99fbfb1385de9e11d2442feba4ef54d19e051c01c3e21246085
Stored in directory: /root/.cache/pip/wheels/eb/cb/be/e663f3ce9aba6580611c0febaf7cd3cf7603f87047de2a52f9
Building wheel for box2d-kengz (setup.py) ... done
Created wheel for box2d-kengz: filename=Box2D_kengz-2.3.3-cp310-cp310-linux_x86_64.whl size=2367283 sha256=2fee7cb2a063adbd7462effc52d4e22ed981f544127955ef8a0c2934e3ee825d
Stored in directory: /root/.cache/pip/wheels/ab/a3/5f/6396406aa0163da86c2a8d28304a120b55cfa98363654d853b
Successfully built box2d box2d-kengz
Installing collected packages: box2d-kengz, box2d
Successfully installed box2d-2.3.2 box2d-kengz-2.3.3
Collecting stable-baselines3[extra]
Downloading stable_baselines3-2.3.2-py3-none-any.whl (182 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 182.3/182.3 kB 3.2 MB/s eta 0:00:00
Collecting gymnasium[box2d]
Downloading gymnasium-0.29.1-py3-none-any.whl (953 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 953.9/953.9 kB 7.0 MB/s eta 0:00:00
Requirement already satisfied: numpy>=1.20 in /usr/local/lib/python3.10/dist-packages (from stable-baselines3[extra]) (1.25.2)
Requirement already satisfied: torch>=1.13 in /usr/local/lib/python3.10/dist-packages (from stable-baselines3[extra]) (2.3.0+cu121)
Requirement already satisfied: cloudpickle in /usr/local/lib/python3.10/dist-packages (from stable-baselines3[extra]) (2.2.1)
Requirement already satisfied: pandas in /usr/local/lib/python3.10/dist-packages (from stable-baselines3[extra]) (2.0.3)
Requirement already satisfied: matplotlib in /usr/local/lib/python3.10/dist-packages (from stable-baselines3[extra]) (3.7.1)
Requirement already satisfied: opencv-python in /usr/local/lib/python3.10/dist-packages (from stable-baselines3[extra]) (4.8.0.76)
Requirement already satisfied: pygame in /usr/local/lib/python3.10/dist-packages (from stable-baselines3[extra]) (2.5.2)
Requirement already satisfied: tensorboard>=2.9.1 in /usr/local/lib/python3.10/dist-packages (from stable-baselines3[extra]) (2.15.2)
Requirement already satisfied: psutil in /usr/local/lib/python3.10/dist-packages (from stable-baselines3[extra]) (5.9.5)
Requirement already satisfied: tqdm in /usr/local/lib/python3.10/dist-packages (from stable-baselines3[extra]) (4.66.4)
Requirement already satisfied: rich in /usr/local/lib/python3.10/dist-packages (from stable-baselines3[extra]) (13.7.1)
Collecting shimmy[atari]~=1.3.0 (from stable-baselines3[extra])
Downloading Shimmy-1.3.0-py3-none-any.whl (37 kB)
Requirement already satisfied: pillow in /usr/local/lib/python3.10/dist-packages (from stable-baselines3[extra]) (9.4.0)
Collecting autorom[accept-rom-license]~=0.6.1 (from stable-baselines3[extra])
Downloading AutoROM-0.6.1-py3-none-any.whl (9.4 kB)
Requirement already satisfied: typing-extensions>=4.3.0 in /usr/local/lib/python3.10/dist-packages (from gymnasium[box2d]) (4.12.2)
Collecting farama-notifications>=0.0.1 (from gymnasium[box2d])
Downloading Farama_Notifications-0.0.4-py3-none-any.whl (2.5 kB)
Collecting box2d-py==2.3.5 (from gymnasium[box2d])
Downloading box2d-py-2.3.5.tar.gz (374 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 374.4/374.4 kB 8.1 MB/s eta 0:00:00
Preparing metadata (setup.py) ... done
Collecting swig==4.* (from gymnasium[box2d])
Downloading swig-4.2.1-py2.py3-none-manylinux_2_5_x86_64.manylinux1_x86_64.whl (1.9 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.9/1.9 MB 12.5 MB/s eta 0:00:00
Requirement already satisfied: click in /usr/local/lib/python3.10/dist-packages (from autorom[accept-rom-license]~=0.6.1->stable-baselines3[extra]) (8.1.7)
Requirement already satisfied: requests in /usr/local/lib/python3.10/dist-packages (from autorom[accept-rom-license]~=0.6.1->stable-baselines3[extra]) (2.31.0)
Collecting AutoROM.accept-rom-license (from autorom[accept-rom-license]~=0.6.1->stable-baselines3[extra])
Downloading AutoROM.accept-rom-license-0.6.1.tar.gz (434 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 434.7/434.7 kB 13.9 MB/s eta 0:00:00
Installing build dependencies ... done
Getting requirements to build wheel ... done
Preparing metadata (pyproject.toml) ... done
Collecting ale-py~=0.8.1 (from shimmy[atari]~=1.3.0->stable-baselines3[extra])
Downloading ale_py-0.8.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.7 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.7/1.7 MB 18.3 MB/s eta 0:00:00
Requirement already satisfied: absl-py>=0.4 in /usr/local/lib/python3.10/dist-packages (from tensorboard>=2.9.1->stable-baselines3[extra]) (1.4.0)
Requirement already satisfied: grpcio>=1.48.2 in /usr/local/lib/python3.10/dist-packages (from tensorboard>=2.9.1->stable-baselines3[extra]) (1.64.1)
Requirement already satisfied: google-auth<3,>=1.6.3 in /usr/local/lib/python3.10/dist-packages (from tensorboard>=2.9.1->stable-baselines3[extra]) (2.27.0)
Requirement already satisfied: google-auth-oauthlib<2,>=0.5 in /usr/local/lib/python3.10/dist-packages (from tensorboard>=2.9.1->stable-baselines3[extra]) (1.2.0)
Requirement already satisfied: markdown>=2.6.8 in /usr/local/lib/python3.10/dist-packages (from tensorboard>=2.9.1->stable-baselines3[extra]) (3.6)
Requirement already satisfied: protobuf!=4.24.0,>=3.19.6 in /usr/local/lib/python3.10/dist-packages (from tensorboard>=2.9.1->stable-baselines3[extra]) (3.20.3)
Requirement already satisfied: setuptools>=41.0.0 in /usr/local/lib/python3.10/dist-packages (from tensorboard>=2.9.1->stable-baselines3[extra]) (67.7.2)
Requirement already satisfied: six>1.9 in /usr/local/lib/python3.10/dist-packages (from tensorboard>=2.9.1->stable-baselines3[extra]) (1.16.0)
Requirement already satisfied: tensorboard-data-server<0.8.0,>=0.7.0 in /usr/local/lib/python3.10/dist-packages (from tensorboard>=2.9.1->stable-baselines3[extra]) (0.7.2)
Requirement already satisfied: werkzeug>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from tensorboard>=2.9.1->stable-baselines3[extra]) (3.0.3)
Requirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from torch>=1.13->stable-baselines3[extra]) (3.14.0)
Requirement already satisfied: sympy in /usr/local/lib/python3.10/dist-packages (from torch>=1.13->stable-baselines3[extra]) (1.12.1)
Requirement already satisfied: networkx in /usr/local/lib/python3.10/dist-packages (from torch>=1.13->stable-baselines3[extra]) (3.3)
Requirement already satisfied: jinja2 in /usr/local/lib/python3.10/dist-packages (from torch>=1.13->stable-baselines3[extra]) (3.1.4)
Requirement already satisfied: fsspec in /usr/local/lib/python3.10/dist-packages (from torch>=1.13->stable-baselines3[extra]) (2023.6.0)
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch>=1.13->stable-baselines3[extra])
Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch>=1.13->stable-baselines3[extra])
Using cached nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (823 kB)
Collecting nvidia-cuda-cupti-cu12==12.1.105 (from torch>=1.13->stable-baselines3[extra])
Using cached nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (14.1 MB)
Collecting nvidia-cudnn-cu12==8.9.2.26 (from torch>=1.13->stable-baselines3[extra])
Using cached nvidia_cudnn_cu12-8.9.2.26-py3-none-manylinux1_x86_64.whl (731.7 MB)
Collecting nvidia-cublas-cu12==12.1.3.1 (from torch>=1.13->stable-baselines3[extra])
Using cached nvidia_cublas_cu12-12.1.3.1-py3-none-manylinux1_x86_64.whl (410.6 MB)
Collecting nvidia-cufft-cu12==11.0.2.54 (from torch>=1.13->stable-baselines3[extra])
Using cached nvidia_cufft_cu12-11.0.2.54-py3-none-manylinux1_x86_64.whl (121.6 MB)
Collecting nvidia-curand-cu12==10.3.2.106 (from torch>=1.13->stable-baselines3[extra])
Using cached nvidia_curand_cu12-10.3.2.106-py3-none-manylinux1_x86_64.whl (56.5 MB)
Collecting nvidia-cusolver-cu12==11.4.5.107 (from torch>=1.13->stable-baselines3[extra])
Using cached nvidia_cusolver_cu12-11.4.5.107-py3-none-manylinux1_x86_64.whl (124.2 MB)
Collecting nvidia-cusparse-cu12==12.1.0.106 (from torch>=1.13->stable-baselines3[extra])
Using cached nvidia_cusparse_cu12-12.1.0.106-py3-none-manylinux1_x86_64.whl (196.0 MB)
Collecting nvidia-nccl-cu12==2.20.5 (from torch>=1.13->stable-baselines3[extra])
Using cached nvidia_nccl_cu12-2.20.5-py3-none-manylinux2014_x86_64.whl (176.2 MB)
Collecting nvidia-nvtx-cu12==12.1.105 (from torch>=1.13->stable-baselines3[extra])
Using cached nvidia_nvtx_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (99 kB)
Requirement already satisfied: triton==2.3.0 in /usr/local/lib/python3.10/dist-packages (from torch>=1.13->stable-baselines3[extra]) (2.3.0)
Collecting nvidia-nvjitlink-cu12 (from nvidia-cusolver-cu12==11.4.5.107->torch>=1.13->stable-baselines3[extra])
Downloading nvidia_nvjitlink_cu12-12.5.40-py3-none-manylinux2014_x86_64.whl (21.3 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 21.3/21.3 MB 35.2 MB/s eta 0:00:00
Requirement already satisfied: contourpy>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib->stable-baselines3[extra]) (1.2.1)
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.10/dist-packages (from matplotlib->stable-baselines3[extra]) (0.12.1)
Requirement already satisfied: fonttools>=4.22.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib->stable-baselines3[extra]) (4.53.0)
Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib->stable-baselines3[extra]) (1.4.5)
Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib->stable-baselines3[extra]) (24.1)
Requirement already satisfied: pyparsing>=2.3.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib->stable-baselines3[extra]) (3.1.2)
Requirement already satisfied: python-dateutil>=2.7 in /usr/local/lib/python3.10/dist-packages (from matplotlib->stable-baselines3[extra]) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas->stable-baselines3[extra]) (2023.4)
Requirement already satisfied: tzdata>=2022.1 in /usr/local/lib/python3.10/dist-packages (from pandas->stable-baselines3[extra]) (2024.1)
Requirement already satisfied: markdown-it-py>=2.2.0 in /usr/local/lib/python3.10/dist-packages (from rich->stable-baselines3[extra]) (3.0.0)
Requirement already satisfied: pygments<3.0.0,>=2.13.0 in /usr/local/lib/python3.10/dist-packages (from rich->stable-baselines3[extra]) (2.16.1)
Requirement already satisfied: importlib-resources in /usr/local/lib/python3.10/dist-packages (from ale-py~=0.8.1->shimmy[atari]~=1.3.0->stable-baselines3[extra]) (6.4.0)
Requirement already satisfied: cachetools<6.0,>=2.0.0 in /usr/local/lib/python3.10/dist-packages (from google-auth<3,>=1.6.3->tensorboard>=2.9.1->stable-baselines3[extra]) (5.3.3)
Requirement already satisfied: pyasn1-modules>=0.2.1 in /usr/local/lib/python3.10/dist-packages (from google-auth<3,>=1.6.3->tensorboard>=2.9.1->stable-baselines3[extra]) (0.4.0)
Requirement already satisfied: rsa<5,>=3.1.4 in /usr/local/lib/python3.10/dist-packages (from google-auth<3,>=1.6.3->tensorboard>=2.9.1->stable-baselines3[extra]) (4.9)
Requirement already satisfied: requests-oauthlib>=0.7.0 in /usr/local/lib/python3.10/dist-packages (from google-auth-oauthlib<2,>=0.5->tensorboard>=2.9.1->stable-baselines3[extra]) (1.3.1)
Requirement already satisfied: mdurl~=0.1 in /usr/local/lib/python3.10/dist-packages (from markdown-it-py>=2.2.0->rich->stable-baselines3[extra]) (0.1.2)
Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests->autorom[accept-rom-license]~=0.6.1->stable-baselines3[extra]) (3.3.2)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests->autorom[accept-rom-license]~=0.6.1->stable-baselines3[extra]) (3.7)
Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests->autorom[accept-rom-license]~=0.6.1->stable-baselines3[extra]) (2.0.7)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests->autorom[accept-rom-license]~=0.6.1->stable-baselines3[extra]) (2024.6.2)
Requirement already satisfied: MarkupSafe>=2.1.1 in /usr/local/lib/python3.10/dist-packages (from werkzeug>=1.0.1->tensorboard>=2.9.1->stable-baselines3[extra]) (2.1.5)
Requirement already satisfied: mpmath<1.4.0,>=1.1.0 in /usr/local/lib/python3.10/dist-packages (from sympy->torch>=1.13->stable-baselines3[extra]) (1.3.0)
Requirement already satisfied: pyasn1<0.7.0,>=0.4.6 in /usr/local/lib/python3.10/dist-packages (from pyasn1-modules>=0.2.1->google-auth<3,>=1.6.3->tensorboard>=2.9.1->stable-baselines3[extra]) (0.6.0)
Requirement already satisfied: oauthlib>=3.0.0 in /usr/local/lib/python3.10/dist-packages (from requests-oauthlib>=0.7.0->google-auth-oauthlib<2,>=0.5->tensorboard>=2.9.1->stable-baselines3[extra]) (3.2.2)
Building wheels for collected packages: box2d-py, AutoROM.accept-rom-license
Building wheel for box2d-py (setup.py) ... done
Created wheel for box2d-py: filename=box2d_py-2.3.5-cp310-cp310-linux_x86_64.whl size=2349142 sha256=44560e55f72a548cfce82551234a2e5255ea91012c23339957d8e2633c406b31
Stored in directory: /root/.cache/pip/wheels/db/8f/6a/eaaadf056fba10a98d986f6dce954e6201ba3126926fc5ad9e
Building wheel for AutoROM.accept-rom-license (pyproject.toml) ... done
Created wheel for AutoROM.accept-rom-license: filename=AutoROM.accept_rom_license-0.6.1-py3-none-any.whl size=446659 sha256=17792ffed9b53d1e50c96b2dad03461aa7df98f90ecbd556b6bc6d4c1e63c791
Stored in directory: /root/.cache/pip/wheels/6b/1b/ef/a43ff1a2f1736d5711faa1ba4c1f61be1131b8899e6a057811
Successfully built box2d-py AutoROM.accept-rom-license
Installing collected packages: swig, farama-notifications, box2d-py, nvidia-nvtx-cu12, nvidia-nvjitlink-cu12, nvidia-nccl-cu12, nvidia-curand-cu12, nvidia-cufft-cu12, nvidia-cuda-runtime-cu12, nvidia-cuda-nvrtc-cu12, nvidia-cuda-cupti-cu12, nvidia-cublas-cu12, gymnasium, ale-py, shimmy, nvidia-cusparse-cu12, nvidia-cudnn-cu12, AutoROM.accept-rom-license, autorom, nvidia-cusolver-cu12, stable-baselines3
Successfully installed AutoROM.accept-rom-license-0.6.1 ale-py-0.8.1 autorom-0.6.1 box2d-py-2.3.5 farama-notifications-0.0.4 gymnasium-0.29.1 nvidia-cublas-cu12-12.1.3.1 nvidia-cuda-cupti-cu12-12.1.105 nvidia-cuda-nvrtc-cu12-12.1.105 nvidia-cuda-runtime-cu12-12.1.105 nvidia-cudnn-cu12-8.9.2.26 nvidia-cufft-cu12-11.0.2.54 nvidia-curand-cu12-10.3.2.106 nvidia-cusolver-cu12-11.4.5.107 nvidia-cusparse-cu12-12.1.0.106 nvidia-nccl-cu12-2.20.5 nvidia-nvjitlink-cu12-12.5.40 nvidia-nvtx-cu12-12.1.105 shimmy-1.3.0 stable-baselines3-2.3.2 swig-4.2.1
También vamos a cargar la extensión de Colab que permite visualizar dashboards de TensorBoard en este notebook.
%load_ext tensorboard
Arrancamos la interfaz de tensorboard para monitorizar la carpeta runs
%tensorboard --logdir "runs"
Enseñando a un robot a andar¶
El entorno sobre el que vamos a trabajar es Bipedal Walker. En él, un robot simulado debe atravesar un terreno irregular de la forma más eficiente posible. Como observaciones, el robot dispone de:
- Módulo de la velocidad de la cabeza.
- Vector de velocidad angular.
- Componente de velocidad horizontal.
- Componente de velocidad vertical.
- Posición de las juntas.
- Módulo de la velocidad angular de las juntas.
- Contacto de las piernas con el suelo.
- 10 mediciones de distancia simulando un LIDAR.
El robot tiene permitidas 4 acciones continuas, en el rango [-1, 1], que representan la velocidad del motor que aplicamos en las 4 juntas disponibles: dos en la "cintura" y dos en las "rodillas" del robot.
Se obtiene recompensa a medida que el robot se mueve hacia la derecha. Si la cabeza del robot robot toca el suelo, sufre una penalización de -100 puntos y el episodio termina. Aplicar fuerza a los motores supone una pequeña penalización, por lo que un robot más eficiente conseguirá más recompensa al desplazarse. El episodio termina sin penalización si el robot alcanza el final del recorrido.

Entrena un agente usando PPO que consiga al menos 250 puntos de recompensa en promedio en una evaluación de 10 episodios.
Deberás entregar tanto este notebook con la solución, como un vídeo que muestre al agente corriendo y finalizando el recorrido.

Algunos consejos para lograr esto:
- No necesitas usar GPU para resolver esta práctica. El entorno es lo suficientemente sencillo como para ejecutar rápidamente en CPU.
- Comienza con entrenamientos de 100.000 timesteps, y ve probando a modificar diferentes parámetros de PPO y de la construcción del entorno para observar con cuáles consigues obtener mejor reward al final del entrenamiento.
- El primer parámetro a optimizar debería ser el número de entornos vectorizados, ya que esto te permite entrenar más rápido. Ten en cuenta que PPO realiza un paso de aprendizaje cada vez que recoge
n_steps * n_envs muestras, por lo que si se aumenta el número de entornos el aprendizaje puede volverse más lento. Para mantener un buen ritmo de aprendizaje se recomienda ajustar el parámetron_steps según el número de entornos, por ejemplo an_steps=2048//n_envs . - La arquitectura de la red neuronal también es un parámetro relevante a optimizar. Prueba modificando el número de capas y el número de neuronas por capa, empezando de menos a más.
- El factor de entropía de PPO (
ent_coef ), que gobierna su cantidad de exploración, también puede aportar mejoras si se ajusta adecuadamente. Se recomienda empezar por valores muy pequeños (ej. 1e-4) e ir subiendo en múltiplos de 10. - Una vez optimizados todos estos parámetros puedes probar a aumentar el número de timesteps de entrenamiento y ver si alcanzas la recompensa objetivo en evaluación.

Se considera que este entorno ha sido completamente resuelto si eres capaz de obtener 300 puntos de recompensa. ¿Puedes crear un agente que lo consiga?
# Importación de librerías
import gymnasium as gym
from stable_baselines3 import PPO
from stable_baselines3.common.vec_env import VecVideoRecorder
from stable_baselines3.common.env_util import make_vec_env, DummyVecEnv
from stable_baselines3.common.evaluation import evaluate_policy
from stable_baselines3.common.utils import set_random_seed
from moviepy.video.fx.all import speedx
from moviepy.editor import VideoFileClip
from pathlib import Path
from IPython.display import Image, display
from functools import partial
from stable_baselines3.common.logger import configure
from statistics import mean
import os
/usr/local/lib/python3.10/dist-packages/skimage/util/dtype.py:27: DeprecationWarning: `np.bool8` is a deprecated alias for `np.bool_`. (Deprecated NumPy 1.24) np.bool8: (False, True), /usr/local/lib/python3.10/dist-packages/moviepy/video/fx/painting.py:7: DeprecationWarning: Please use `sobel` from the `scipy.ndimage` namespace, the `scipy.ndimage.filters` namespace is deprecated. from scipy.ndimage.filters import sobel
Iniciamos fijando la semilla
SEED = 42
set_random_seed(SEED)
WARNING:py.warnings:/usr/local/lib/python3.10/dist-packages/ipykernel/ipkernel.py:283: DeprecationWarning: `should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above. and should_run_async(code)
# Creamos la función para generar el vídeo
def record_video(env, model, prefix, steps=1000):
"""Generates a video of a reinforcement learning model interacting with an environment.
Arguments:
env: name of the environment (str) or callable generating the environment
model: reinforcement learning model
steps: number of steps to render
"""
video_folder = "videos/"
if callable(env):
envfun = partial(env, render_mode="rgb_array")
else:
envfun = lambda: gym.make(env, render_mode="rgb_array")
vec_env = DummyVecEnv([envfun])
obs = vec_env.reset()
# Record the video starting at the first step
vec_env = VecVideoRecorder(vec_env, video_folder,
record_video_trigger=lambda x: x == 0, video_length=steps,
name_prefix=prefix)
vec_env.reset()
for _ in range(steps + 1):
action = [model.predict(obs[0])[0]]
obs, _, _, _ = vec_env.step(action)
# Save the video
vec_env.close()
# Creamos una función para convertir estos vídeos en gifs para su visualización dentro del notebook.
def mp4_to_gif(input_path, output_path, start_time=0, end_time=20, resize_factor=1):
if not os.path.exists('gif'):
os.makedirs('gif')
clip = VideoFileClip(input_path)
if end_time:
clip = clip.subclip(start_time, end_time)
else:
clip = clip.subclip(start_time)
clip = speedx(clip, factor=3)
if resize_factor != 1:
clip = clip.resize(resize_factor)
output_path = "gif/" + output_path
# Write the result to a GIF file
clip.write_gif(output_path, fps=clip.fps)
clip.close()
def display_gif(path):
gifPath = Path(path)
# Display GIF in Jupyter, CoLab, IPython
with open(gifPath, 'rb') as f:
display(Image(data=f.read(), format='png'))
# Creamos el directorio de logs
log_dir = "./runs/"
os.makedirs(log_dir, exist_ok=True)
Modelo 1: El punto de partida¶
Partimos de un modelo sencillo con 4 entornos y el resto de parámetros por defecto según lo descrito en la librería de stable-baselines. El modelo consta de:
- 4 environments
- 2048 // 4 steps
- 100000 timesteps
# Creamos el ambiente
n_envs = 4
vec_env = make_vec_env("BipedalWalker-v3", n_envs = 4, seed=SEED)
# Creamos el modelo PPO
model = PPO("MlpPolicy", vec_env, n_steps = 2048//n_envs, tensorboard_log=log_dir, seed=SEED)
# Entrenamos el modelo
model.learn(total_timesteps=100000, log_interval = 10)
<stable_baselines3.ppo.ppo.PPO at 0x7de24267f4f0>
# Guardamos el modelo
model.save(os.path.join(log_dir, "ppo_bipedalwalker1"))
# Evaluamos
mean_reward, std_reward = evaluate_policy(model, vec_env, n_eval_episodes=10)
print(f'Mean Reward: {mean_reward} +/- {std_reward}')
Mean Reward: -59.8152412 +/- 12.237160118984027
# Renderizamos el vídeo
record_video("BipedalWalker-v3", model, prefix="model_1")
Saving video to /content/videos/model_1-step-0-to-step-1000.mp4 Moviepy - Building video /content/videos/model_1-step-0-to-step-1000.mp4. Moviepy - Writing video /content/videos/model_1-step-0-to-step-1000.mp4
Moviepy - Done ! Moviepy - video ready /content/videos/model_1-step-0-to-step-1000.mp4
mp4_to_gif("/content/videos/model_1-step-0-to-step-1000.mp4", "model1.gif")
display_gif("/content/gif/model1.gif")
MoviePy - Building file gif/model1.gif with imageio.
Tenemos un reward promedio de -59.81, con una desviación estándar de +/- 12.29, y el caminante tiene un paso bastante curioso.
Modifiquemos los parámetros y volvamos a entrenar.
Model 2: Ajustando el número de entornos¶
Comenzaremos con una mayor cantidad de entornos para ver si mejora la velocidad del entrenamiento y la recompensa final.
Aumentamos al doble de entornos. El resto de parámetros los dejamos constantes.
Modelo 2.1: Aumentando los entornos
n_envs = 8
vec_env = make_vec_env("BipedalWalker-v3", n_envs=n_envs)
n_steps = 2048 // n_envs
model_2 = PPO("MlpPolicy", vec_env, n_steps=n_steps, tensorboard_log=log_dir)
model_2.learn(total_timesteps=100000, log_interval=10)
model_2.save(os.path.join(log_dir, "ppo_bipedalwalker2"))
# Evaluamos
mean_reward_2, std_reward_2 = evaluate_policy(model_2, vec_env, n_eval_episodes=10)
print(f'Mean Reward: {mean_reward_2} +/- {std_reward_2}')
Mean Reward: -115.7177201 +/- 22.90682790412311
record_video("BipedalWalker-v3", model_2, prefix="model2")
Saving video to /content/videos/model2-step-0-to-step-1000.mp4 Moviepy - Building video /content/videos/model2-step-0-to-step-1000.mp4. Moviepy - Writing video /content/videos/model2-step-0-to-step-1000.mp4
Moviepy - Done ! Moviepy - video ready /content/videos/model2-step-0-to-step-1000.mp4
mp4_to_gif("./videos/model2-step-0-to-step-1000.mp4", "./model2.gif")
MoviePy - Building file gif/./model2.gif with imageio.
display_gif("./gif/model2.gif")
Resultado inesperado. La recompensa promedio disminiyó. Aumentar la cantidad de entornos no mejoró el rendimiento del modelo. Procedamos con otra modificación de parámetros.
Modelo 3: Ajuste de la arquitectura de la red neuronal¶
n_envs = 4
vec_env = make_vec_env("BipedalWalker-v3", n_envs=n_envs, seed=SEED)
n_steps = 2048 // n_envs
Definamos una arquitectura de red neuronal propia. Iniciamos con un arquitectura de 2 capas y 64 neuronas
from stable_baselines3.common.policies import ActorCriticPolicy
from torch import nn
import torch as th
class CustomPolicy(ActorCriticPolicy):
def __init__(self, *args, **kwargs):
super(CustomPolicy, self).__init__(*args, **kwargs)
self.net_arch = [
{"pi": [64, 64], "vf": [64, 64]}
]
# Policy network
self.pi_net = nn.Sequential(
nn.Linear(self.features_dim, 64),
nn.ReLU(),
nn.Linear(64, 64),
nn.ReLU()
)
# Value network
self.vf_net = nn.Sequential(
nn.Linear(self.features_dim, 64),
nn.ReLU(),
nn.Linear(64, 64),
nn.ReLU()
)
def forward(self, obs: th.Tensor, deterministic: bool = False):
features = self.extract_features(obs)
latent_pi = self.pi_net(features)
latent_vf = self.vf_net(features)
distribution = self._get_action_dist_from_latent(latent_pi)
actions = distribution.get_actions(deterministic=deterministic)
value = self.value_net(latent_vf)
log_prob = distribution.log_prob(actions)
return actions, value, log_prob
def _predict(self, observation: th.Tensor, deterministic: bool = False):
return self.forward(observation, deterministic=deterministic)[0]
model3 = PPO(CustomPolicy, vec_env, n_steps=n_steps, tensorboard_log=log_dir, seed=SEED)
model3.learn(total_timesteps=100000, log_interval=10)
model3.save(os.path.join(log_dir, "ppo_bipedalwalker3"))
# Evaluamos
mean_reward_3, std_reward_3 = evaluate_policy(model3, vec_env, n_eval_episodes=10)
print(f'Mean Reward: {mean_reward_3} +/- {std_reward_3}')
Mean Reward: -100.1977372 +/- 0.11762179555320403
record_video("BipedalWalker-v3", model3, prefix="model3")
mp4_to_gif("./videos/model3-step-0-to-step-1000.mp4", "./model3.gif")
Saving video to /content/videos/model3-step-0-to-step-1000.mp4 Moviepy - Building video /content/videos/model3-step-0-to-step-1000.mp4. Moviepy - Writing video /content/videos/model3-step-0-to-step-1000.mp4
Moviepy - Done ! Moviepy - video ready /content/videos/model3-step-0-to-step-1000.mp4 MoviePy - Building file gif/./model3.gif with imageio.
display_gif("./gif/model3.gif")
Si bien la recompensa mejoró levemente, no lo suficiente para justificar seguir utilizando esta arquitectura.
Procedamos con otra modificación, esta vez ajustando el coeficiente de entropía (ent_coef).
Modelo 4: Aumentando el coeficiente de entropía¶
El coeficiente de entropía es un regularizador. Una política tiene entropía máxima cuando todas las políticas son igualmente probables y mínima cuando la probabilidad de acción de la política es dominante. El coeficiente de entropía se multiplica por la entropía máxima posible y se suma a la pérdida. Esto ayuda a evitar la convergencia prematura de una probabilidad de acción que domina la política e impide la exploración.
n_envs = 4
vec_env = make_vec_env("BipedalWalker-v3", n_envs=n_envs, seed=SEED)
n_steps = 2048 // n_envs
model4 = PPO("MlpPolicy", vec_env, n_steps=n_steps, ent_coef=0.0001, tensorboard_log=log_dir)
model4.learn(total_timesteps=100000, log_interval=10)
model4.save(os.path.join(log_dir, "ppo_bipedalwalker4"))
# Evaluamos
mean_reward_4, std_reward_4 = evaluate_policy(model4, vec_env, n_eval_episodes=10)
print(f'Mean Reward: {mean_reward_4} +/- {std_reward_4}')
Mean Reward: -104.2500812 +/- 0.18599696443963717
record_video("BipedalWalker-v3", model3, prefix="model4")
mp4_to_gif("./videos/model4-step-0-to-step-1000.mp4", "./model4.gif")
Saving video to /content/videos/model4-step-0-to-step-1000.mp4 Moviepy - Building video /content/videos/model4-step-0-to-step-1000.mp4. Moviepy - Writing video /content/videos/model4-step-0-to-step-1000.mp4
Moviepy - Done ! Moviepy - video ready /content/videos/model4-step-0-to-step-1000.mp4 MoviePy - Building file gif/./model4.gif with imageio.
display_gif("./gif/model4.gif")
Aumentar el coeficiente de entropía empeoró nuestro modelo. El empujarle a explorar fue contraproducente.
Intentemos aumentando en radio de aprendizaje:
Modelo 5: Aumentando el radio de aprendizaje¶
El learning rate determina el tamaño del paso en cada iteración mientras avanza hacia un mínimo de la función de pérdida. Básicamente, controla cuánto cambiar el modelo en respuesta al error estimado cada vez que se actualizan las ponderaciones del modelo.
Por defecto, stable-baselines tiene un learning_rate = 0.0003. Aumentamos levemente a 0.0004 para observar el comportamiento del modelo:
n_envs = 4
vec_env = make_vec_env("BipedalWalker-v3", n_envs=n_envs, seed=SEED)
n_steps = 2048 // n_envs
model5 = PPO("MlpPolicy", vec_env, n_steps=n_steps, learning_rate=0.0004, tensorboard_log=log_dir, seed=SEED)
model5.learn(total_timesteps=100000, log_interval=10)
model5.save(os.path.join(log_dir, "ppo_bipedalwalker5"))
# Evaluamos
mean_reward_5, std_reward_5 = evaluate_policy(model5, vec_env, n_eval_episodes=10)
print(f'Mean Reward: {mean_reward_5} +/- {std_reward_5}')
WARNING:py.warnings:/usr/local/lib/python3.10/dist-packages/ipykernel/ipkernel.py:283: DeprecationWarning: `should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above. and should_run_async(code)
Mean Reward: 148.36919389999997 +/- 5.604317288530378
La recompensa promedio aumento muchísimo! Sigamos jugando con más parámetros
Modelo 6: Aumentemos el tamaño del lote.¶
El tamaño del lote es uno de los hiperparámetros más importantes en el entrenamiento de aprendizaje profundo, representa la cantidad de muestras utilizadas en un paso hacia adelante y hacia atrás a través de la red y tiene un impacto directo en la precisión y la eficiencia computacional del proceso de entrenamiento.
Aumentemos el tamaño del lote de 64 a 128 y observemos si el modelo obtiene mejores resultados:
n_envs = 4
vec_env = make_vec_env("BipedalWalker-v3", n_envs=n_envs, seed = SEED)
n_steps = 2048 // n_envs
model6 = PPO(
policy='MlpPolicy',
env=vec_env,
n_steps= 2048 // n_envs,
learning_rate=0.0004,
batch_size = 128,
tensorboard_log=log_dir,
seed = SEED
)
model6.learn(total_timesteps=100000, log_interval=10)
model6.save(os.path.join(log_dir, "ppo_bipedalwalker6"))
# Evaluamos
mean_reward_6, std_reward_6 = evaluate_policy(model6, vec_env, n_eval_episodes=10)
print(f'Mean Reward: {mean_reward_6} +/- {std_reward_6}')
Mean Reward: 71.2720534 +/- 2.8262695998822256
Parece que al aumentar el tamaño del lote obtenemos un rendimiento peor.
Modelo 7: Parámetro Gamma¶
Empecemos a jugar más con los otros parámetros expuestos en la librería de stable-baselines.
n_envs = 4
vec_env = make_vec_env("BipedalWalker-v3", n_envs=n_envs, seed = SEED)
Aumentamos ligeramente el gamma del defecto 0.99 a 0.999
model7 = PPO(
policy='MlpPolicy',
env=vec_env,
n_steps= 2048 // n_envs,
gamma = 0.999,
learning_rate=0.0004,
tensorboard_log=log_dir,
seed = SEED
)
model7.learn(total_timesteps=100000, log_interval=10)
model7.save(os.path.join(log_dir, "ppo_bipedalwalker7"))
WARNING:py.warnings:/usr/local/lib/python3.10/dist-packages/ipykernel/ipkernel.py:283: DeprecationWarning: `should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above. and should_run_async(code)
# Evaluamos
mean_reward_7, std_reward_7 = evaluate_policy(model7, vec_env, n_eval_episodes=10)
print(f'Mean Reward: {mean_reward_7} +/- {std_reward_7}')
Mean Reward: 86.69985259999999 +/- 86.12514877214429
Disminuyó con respecto a nuestro mejor modelo.
Modelo 7.5: Disminuyendo el Gamma
model7_5 = PPO(
policy='MlpPolicy',
env=vec_env,
n_steps= 2048 // n_envs,
gamma = 0.9,
learning_rate=0.0004,
tensorboard_log=log_dir,
seed = SEED
)
model7_5.learn(total_timesteps=100000, log_interval=10)
model7_5.save(os.path.join(log_dir, "ppo_bipedalwalker7_5"))
# Evaluamos
mean_reward_7_5, std_reward_7_5 = evaluate_policy(model7_5, vec_env, n_eval_episodes=10)
print(f'Mean Reward: {mean_reward_7_5} +/- {std_reward_7_5}')
Mean Reward: -91.7253053 +/- 10.889749073977962
Definitivamente disminuir el gamma no fue acertado para esta ocasión.
Modelo 8: Aumentando el número de lotes¶
n_envs = 4
vec_env = make_vec_env("BipedalWalker-v3", n_envs=n_envs, seed = SEED)
model8 = PPO(
policy='MlpPolicy',
env=vec_env,
n_steps= 2048 // n_envs,
learning_rate=0.0004,
tensorboard_log=log_dir,
batch_size = 128,
seed = SEED,
)
model8.learn(total_timesteps=100000, log_interval=10)
model8.save(os.path.join(log_dir, "ppo_bipedalwalker8"))
# Evaluamos
mean_reward_8, std_reward_8 = evaluate_policy(model8, vec_env, n_eval_episodes=10)
print(f'Mean Reward: {mean_reward_8} +/- {std_reward_8}')
Mean Reward: 71.2720534 +/- 2.8262695998822256
Aún no superamos nuestro modelo 5
Modelo 9: Aumentando el número de epochs¶
n_envs = 4
vec_env = make_vec_env("BipedalWalker-v3", n_envs=n_envs, seed = SEED)
n_steps = 2048 // n_envs
model9 = PPO(
policy='MlpPolicy',
env=vec_env,
n_steps= 2048 // n_envs,
learning_rate=0.0004,
tensorboard_log=log_dir,
seed = SEED,
n_epochs = 20
)
model9.learn(total_timesteps=100000, log_interval=10)
model9.save(os.path.join(log_dir, "ppo_bipedalwalker_9"))
mean_reward_9, std_reward_9 = evaluate_policy(model9, vec_env, n_eval_episodes=10)
print(f'Mean Reward: {mean_reward_9} +/- {std_reward_9}')
WARNING:py.warnings:/usr/local/lib/python3.10/dist-packages/ipykernel/ipkernel.py:283: DeprecationWarning: `should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above. and should_run_async(code)
Mean Reward: 138.27054779999997 +/- 7.720433594014249
El modelo no parece mejorar mucho aumentando el número de epochs, y esto aumenta el tiempo de entrenamiento, así que podemos descartarlo
Intetemos tomar el modelo 5 y aumentar el time_step
Modelo 10: 200.000 timesteps¶
n_envs = 4
vec_env = make_vec_env("BipedalWalker-v3", n_envs=n_envs, seed = SEED)
model10 = PPO(
policy='MlpPolicy',
env=vec_env,
n_steps= 2048 // n_envs,
learning_rate=0.0004,
tensorboard_log=log_dir,
seed = SEED
)
model10.learn(total_timesteps=200000, log_interval=10)
model10.save(os.path.join(log_dir, "ppo_bipedalwalker10"))
mean_reward_10, std_reward_10 = evaluate_policy(model10, vec_env, n_eval_episodes=10)
print(f'Mean Reward: {mean_reward_10} +/- {std_reward_10}')
Mean Reward: 207.04815539999998 +/- 9.881409134642428
Definitivamente aumentar el número de timesteps aumenta el mean reward. Que pasaría si lo aumentamos más?
Modelo 11: 1.000.000 timesteps¶
n_envs = 4
vec_env = make_vec_env("BipedalWalker-v3", n_envs=n_envs, seed = SEED)
model11 = PPO(
policy='MlpPolicy',
env=vec_env,
n_steps= 2048 // n_envs,
learning_rate=0.0004,
tensorboard_log=log_dir,
seed = SEED
)
model11.learn(total_timesteps=1_000_000, log_interval=10)
model11.save(os.path.join(log_dir, "ppo_bipedalwalker11"))
mean_reward_11, std_reward_11 = evaluate_policy(model11, vec_env, n_eval_episodes=10)
print(f'Mean Reward: {mean_reward_11} +/- {std_reward_11}')
Mean Reward: 180.32631859999998 +/- 132.25301662354542
A pesar de aumentar los timesteps, el modelo se volvió más inestable, con un promedio menor y una desviación estándar muchísimo más alta.
Intentemos implementar un learning rate dinámico, el cual disminuye el learning rate con el tiempo. Esto puede ayudar a que el modelo converja de manera más estable durante períodos de entrenamiento más largos.
Modelo final: Learning rate dinámico¶
# Creamos el ambiente
n_envs = 4
vec_env = make_vec_env("BipedalWalker-v3", n_envs = 4, seed=SEED)
model12 = PPO(
policy='MlpPolicy',
env=vec_env,
n_steps= 2048 // n_envs,
learning_rate=lambda f: 3e-4 * f, # Learning rate dinámico
tensorboard_log=log_dir,
seed = SEED
)
model12.learn(total_timesteps=1_000_000, log_interval=10)
model12.save(os.path.join(log_dir, "ppo_bipedalwalker12"))
WARNING:py.warnings:/usr/local/lib/python3.10/dist-packages/ipykernel/ipkernel.py:283: DeprecationWarning: `should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above. and should_run_async(code)
mean_reward_12, std_reward_12 = evaluate_policy(model12, vec_env, n_eval_episodes=10)
print(f'Mean Reward: {mean_reward_12} +/- {std_reward_12}')
Mean Reward: 293.24582110000006 +/- 0.46386835227496104
Finalmente, conseguímos un rendimiento promedio de 293 puntos, con una desviación estándar de 0.463
Miremos como se comporta nuestro amigo
record_video("BipedalWalker-v3", model12, prefix="model_final")
Saving video to /content/videos/model_final-step-0-to-step-1000.mp4 Moviepy - Building video /content/videos/model_final-step-0-to-step-1000.mp4. Moviepy - Writing video /content/videos/model_final-step-0-to-step-1000.mp4
Moviepy - Done ! Moviepy - video ready /content/videos/model_final-step-0-to-step-1000.mp4
mp4_to_gif("/content/videos/model_final-step-0-to-step-1000.mp4", "model_final.gif")
display_gif("/content/gif/model_final.gif")
MoviePy - Building file gif/model_final.gif with imageio.
Modo hardcore¶

Este entorno dispone de un modo mucho más difícil en el que el terreno está plagado de escaleras, desniveles, fosas, ... este nivel de dificultad puede activarse indicando la opción
Modelo 1: Modelo anterior
# Creamos el ambiente
n_envs = 4
env_kwargs = {"hardcore": True}
vec_env_hardcore = make_vec_env("BipedalWalker-v3", n_envs=n_envs, seed=SEED, env_kwargs=env_kwargs)
model_final_hardcore = PPO(
policy="MlpPolicy",
env=vec_env_hardcore,
n_steps=2048,
batch_size=32,
gae_lambda=0.95,
gamma=0.99,
n_epochs=10,
ent_coef=0.0,
learning_rate=lambda f: 3e-4 * f,
clip_range=0.2,
vf_coef=0.5,
max_grad_norm=0.5,
verbose=1,
tensorboard_log="./runs/",
seed=SEED
)
WARNING:py.warnings:/usr/local/lib/python3.10/dist-packages/ipykernel/ipkernel.py:283: DeprecationWarning: `should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above. and should_run_async(code)
Using cpu device
# Entrenamos el modelo
model_final_hardcore.learn(total_timesteps=1_000_000, log_interval = 10)
WARNING:py.warnings:/usr/local/lib/python3.10/dist-packages/ipykernel/ipkernel.py:283: DeprecationWarning: `should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above. and should_run_async(code)
Logging to ./runs/PPO_1 ----------------------------------------- | rollout/ | | | ep_len_mean | 591 | | ep_rew_mean | -108 | | time/ | | | fps | 514 | | iterations | 10 | | time_elapsed | 159 | | total_timesteps | 81920 | | train/ | | | approx_kl | 0.008627472 | | clip_fraction | 0.0747 | | clip_range | 0.2 | | entropy_loss | -5.37 | | explained_variance | 0.863 | | learning_rate | 0.000278 | | loss | 3.07 | | n_updates | 90 | | policy_gradient_loss | -0.00155 | | std | 0.924 | | value_loss | 22.7 | ----------------------------------------- ----------------------------------------- | rollout/ | | | ep_len_mean | 944 | | ep_rew_mean | -109 | | time/ | | | fps | 496 | | iterations | 20 | | time_elapsed | 329 | | total_timesteps | 163840 | | train/ | | | approx_kl | 0.008381297 | | clip_fraction | 0.0706 | | clip_range | 0.2 | | entropy_loss | -4.8 | | explained_variance | 0.942 | | learning_rate | 0.000253 | | loss | 0.496 | | n_updates | 190 | | policy_gradient_loss | -0.00245 | | std | 0.802 | | value_loss | 7.31 | ----------------------------------------- ---------------------------------------- | rollout/ | | | ep_len_mean | 1.15e+03 | | ep_rew_mean | -105 | | time/ | | | fps | 491 | | iterations | 30 | | time_elapsed | 500 | | total_timesteps | 245760 | | train/ | | | approx_kl | 0.01001938 | | clip_fraction | 0.079 | | clip_range | 0.2 | | entropy_loss | -4.19 | | explained_variance | 0.915 | | learning_rate | 0.000229 | | loss | 0.122 | | n_updates | 290 | | policy_gradient_loss | -0.00367 | | std | 0.685 | | value_loss | 9.19 | ---------------------------------------- ----------------------------------------- | rollout/ | | | ep_len_mean | 1.34e+03 | | ep_rew_mean | -96.3 | | time/ | | | fps | 473 | | iterations | 40 | | time_elapsed | 692 | | total_timesteps | 327680 | | train/ | | | approx_kl | 0.007109425 | | clip_fraction | 0.0967 | | clip_range | 0.2 | | entropy_loss | -3.53 | | explained_variance | 0.889 | | learning_rate | 0.000204 | | loss | 0.0306 | | n_updates | 390 | | policy_gradient_loss | -0.00343 | | std | 0.584 | | value_loss | 0.111 | ----------------------------------------- ----------------------------------------- | rollout/ | | | ep_len_mean | 1.42e+03 | | ep_rew_mean | -84.5 | | time/ | | | fps | 465 | | iterations | 50 | | time_elapsed | 879 | | total_timesteps | 409600 | | train/ | | | approx_kl | 0.009240942 | | clip_fraction | 0.0949 | | clip_range | 0.2 | | entropy_loss | -3 | | explained_variance | 0.891 | | learning_rate | 0.00018 | | loss | 0.0463 | | n_updates | 490 | | policy_gradient_loss | -0.00181 | | std | 0.514 | | value_loss | 5.41 | ----------------------------------------- ----------------------------------------- | rollout/ | | | ep_len_mean | 1.46e+03 | | ep_rew_mean | -73.1 | | time/ | | | fps | 465 | | iterations | 60 | | time_elapsed | 1055 | | total_timesteps | 491520 | | train/ | | | approx_kl | 0.008234881 | | clip_fraction | 0.0885 | | clip_range | 0.2 | | entropy_loss | -2.46 | | explained_variance | 0.955 | | learning_rate | 0.000155 | | loss | 0.0862 | | n_updates | 590 | | policy_gradient_loss | -0.00214 | | std | 0.451 | | value_loss | 1.12 | ----------------------------------------- ----------------------------------------- | rollout/ | | | ep_len_mean | 1.47e+03 | | ep_rew_mean | -63.6 | | time/ | | | fps | 466 | | iterations | 70 | | time_elapsed | 1229 | | total_timesteps | 573440 | | train/ | | | approx_kl | 0.009738021 | | clip_fraction | 0.11 | | clip_range | 0.2 | | entropy_loss | -1.9 | | explained_variance | 0.872 | | learning_rate | 0.00013 | | loss | 0.0289 | | n_updates | 690 | | policy_gradient_loss | -0.00422 | | std | 0.392 | | value_loss | 0.959 | ----------------------------------------- ------------------------------------------ | rollout/ | | | ep_len_mean | 1.45e+03 | | ep_rew_mean | -56.3 | | time/ | | | fps | 469 | | iterations | 80 | | time_elapsed | 1396 | | total_timesteps | 655360 | | train/ | | | approx_kl | 0.0083923815 | | clip_fraction | 0.102 | | clip_range | 0.2 | | entropy_loss | -1.7 | | explained_variance | 0.642 | | learning_rate | 0.000106 | | loss | 0.111 | | n_updates | 790 | | policy_gradient_loss | -0.00458 | | std | 0.372 | | value_loss | 2.72 | ------------------------------------------ ------------------------------------------ | rollout/ | | | ep_len_mean | 1.41e+03 | | ep_rew_mean | -57 | | time/ | | | fps | 470 | | iterations | 90 | | time_elapsed | 1568 | | total_timesteps | 737280 | | train/ | | | approx_kl | 0.0073641813 | | clip_fraction | 0.0801 | | clip_range | 0.2 | | entropy_loss | -1.44 | | explained_variance | 0.967 | | learning_rate | 8.13e-05 | | loss | 0.142 | | n_updates | 890 | | policy_gradient_loss | -0.00153 | | std | 0.349 | | value_loss | 1.82 | ------------------------------------------ ----------------------------------------- | rollout/ | | | ep_len_mean | 1.34e+03 | | ep_rew_mean | -56.3 | | time/ | | | fps | 470 | | iterations | 100 | | time_elapsed | 1740 | | total_timesteps | 819200 | | train/ | | | approx_kl | 0.006947366 | | clip_fraction | 0.0768 | | clip_range | 0.2 | | entropy_loss | -1.23 | | explained_variance | 0.865 | | learning_rate | 5.67e-05 | | loss | 0.102 | | n_updates | 990 | | policy_gradient_loss | -0.00344 | | std | 0.33 | | value_loss | 3.47 | ----------------------------------------- ----------------------------------------- | rollout/ | | | ep_len_mean | 1.34e+03 | | ep_rew_mean | -52.2 | | time/ | | | fps | 472 | | iterations | 110 | | time_elapsed | 1906 | | total_timesteps | 901120 | | train/ | | | approx_kl | 0.006874743 | | clip_fraction | 0.0658 | | clip_range | 0.2 | | entropy_loss | -1.05 | | explained_variance | 0.929 | | learning_rate | 3.21e-05 | | loss | 0.363 | | n_updates | 1090 | | policy_gradient_loss | -0.00505 | | std | 0.317 | | value_loss | 4.45 | ----------------------------------------- ------------------------------------------ | rollout/ | | | ep_len_mean | 1.32e+03 | | ep_rew_mean | -46.2 | | time/ | | | fps | 474 | | iterations | 120 | | time_elapsed | 2073 | | total_timesteps | 983040 | | train/ | | | approx_kl | 0.0040372424 | | clip_fraction | 0.0205 | | clip_range | 0.2 | | entropy_loss | -0.992 | | explained_variance | 0.89 | | learning_rate | 7.55e-06 | | loss | 0.194 | | n_updates | 1190 | | policy_gradient_loss | -0.00407 | | std | 0.312 | | value_loss | 0.403 | ------------------------------------------
<stable_baselines3.ppo.ppo.PPO at 0x7e369d33ef20>
# Evaluamos
mean_reward, std_reward = evaluate_policy(model_final_hardcore, vec_env_hardcore, n_eval_episodes=10)
print(f'Mean Reward: {mean_reward} +/- {std_reward}')
WARNING:py.warnings:/usr/local/lib/python3.10/dist-packages/ipykernel/ipkernel.py:283: DeprecationWarning: `should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above. and should_run_async(code)
Mean Reward: -52.5798312 +/- 42.365340695126456
model_final_hardcore.save(os.path.join(log_dir, "ppo_bipedalwalker_hardcore"))
WARNING:py.warnings:/usr/local/lib/python3.10/dist-packages/ipykernel/ipkernel.py:283: DeprecationWarning: `should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above. and should_run_async(code)
Modelo 2: Aumento de ambientes¶
# Creamos el ambiente
n_envs = 32
vec_env_hardcore_2 = make_vec_env("BipedalWalker-v3", n_envs=n_envs, seed=SEED, env_kwargs=env_kwargs)
WARNING:py.warnings:/usr/local/lib/python3.10/dist-packages/ipykernel/ipkernel.py:283: DeprecationWarning: `should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above. and should_run_async(code)
model_final_hardcore_2 = PPO(
policy="MlpPolicy",
env=vec_env_hardcore_2,
n_steps=2048,
batch_size=64,
gae_lambda=0.95,
gamma=0.999,
n_epochs=10,
ent_coef=0.0,
learning_rate=lambda f: 3e-4 * f,
clip_range=0.18,
vf_coef=0.5,
max_grad_norm=0.5,
verbose=1,
tensorboard_log="./runs/",
seed=SEED
)
Using cpu device
# Entrenamos el modelo
model_final_hardcore_2.learn(total_timesteps=1_000_000, log_interval = 10)
Logging to ./runs/PPO_3 ----------------------------------------- | rollout/ | | | ep_len_mean | 1.06e+03 | | ep_rew_mean | -110 | | time/ | | | fps | 803 | | iterations | 10 | | time_elapsed | 815 | | total_timesteps | 655360 | | train/ | | | approx_kl | 0.004645572 | | clip_fraction | 0.0566 | | clip_range | 0.18 | | entropy_loss | -5.3 | | explained_variance | 0.915 | | learning_rate | 0.000123 | | loss | 4.56 | | n_updates | 90 | | policy_gradient_loss | -0.00131 | | std | 0.909 | | value_loss | 14.6 | -----------------------------------------
<stable_baselines3.ppo.ppo.PPO at 0x7e3696611d20>
# Evaluamos
mean_reward, std_reward = evaluate_policy(model_final_hardcore_2, vec_env_hardcore, n_eval_episodes=10)
print(f'Mean Reward: {mean_reward} +/- {std_reward}')
WARNING:py.warnings:/usr/local/lib/python3.10/dist-packages/ipykernel/ipkernel.py:283: DeprecationWarning: `should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above. and should_run_async(code)
Mean Reward: -41.531564499999995 +/- 38.6854997181889
# Creamos el ambiente
n_envs = 4
env_kwargs = {"hardcore": True}
vec_env_hardcore = make_vec_env("BipedalWalker-v3", n_envs=n_envs, seed=SEED, env_kwargs=env_kwargs)
WARNING:py.warnings:/usr/local/lib/python3.10/dist-packages/ipykernel/ipkernel.py:283: DeprecationWarning: `should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above. and should_run_async(code)
Modelo 3: Mayor número de time-steps¶
model_final_hardcore_3 = PPO(
policy="MlpPolicy",
env=vec_env_hardcore_2,
n_steps=2048,
batch_size=64,
gae_lambda=0.95,
gamma=0.999,
n_epochs=10,
ent_coef=0.0,
learning_rate=lambda f: 3e-4 * f,
clip_range=0.18,
vf_coef=0.5,
max_grad_norm=0.5,
verbose=1,
tensorboard_log="./runs/",
seed=SEED
)
WARNING:py.warnings:/usr/local/lib/python3.10/dist-packages/ipykernel/ipkernel.py:283: DeprecationWarning: `should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above. and should_run_async(code)
Using cpu device
# Entrenamos el modelo
model_final_hardcore_3.learn(total_timesteps=5_000_000, log_interval = 10)
Logging to ./runs/PPO_4 ------------------------------------------ | rollout/ | | | ep_len_mean | 1.11e+03 | | ep_rew_mean | -114 | | time/ | | | fps | 839 | | iterations | 10 | | time_elapsed | 781 | | total_timesteps | 655360 | | train/ | | | approx_kl | 0.0071131703 | | clip_fraction | 0.0839 | | clip_range | 0.18 | | entropy_loss | -5.24 | | explained_variance | 0.918 | | learning_rate | 0.000265 | | loss | 3.41 | | n_updates | 90 | | policy_gradient_loss | 0.000163 | | std | 0.892 | | value_loss | 12.4 | ------------------------------------------ ----------------------------------------- | rollout/ | | | ep_len_mean | 1.25e+03 | | ep_rew_mean | -102 | | time/ | | | fps | 809 | | iterations | 20 | | time_elapsed | 1619 | | total_timesteps | 1310720 | | train/ | | | approx_kl | 0.005910085 | | clip_fraction | 0.0762 | | clip_range | 0.18 | | entropy_loss | -4.8 | | explained_variance | 0.955 | | learning_rate | 0.000225 | | loss | 8.09 | | n_updates | 190 | | policy_gradient_loss | 2.06e-05 | | std | 0.801 | | value_loss | 7.32 | ----------------------------------------- ----------------------------------------- | rollout/ | | | ep_len_mean | 1.38e+03 | | ep_rew_mean | -96.5 | | time/ | | | fps | 803 | | iterations | 30 | | time_elapsed | 2447 | | total_timesteps | 1966080 | | train/ | | | approx_kl | 0.005278159 | | clip_fraction | 0.0661 | | clip_range | 0.18 | | entropy_loss | -4.31 | | explained_variance | 0.966 | | learning_rate | 0.000186 | | loss | 15.3 | | n_updates | 290 | | policy_gradient_loss | -0.00047 | | std | 0.708 | | value_loss | 8.9 | ----------------------------------------- ----------------------------------------- | rollout/ | | | ep_len_mean | 1.35e+03 | | ep_rew_mean | -92.5 | | time/ | | | fps | 795 | | iterations | 40 | | time_elapsed | 3297 | | total_timesteps | 2621440 | | train/ | | | approx_kl | 0.004496563 | | clip_fraction | 0.0613 | | clip_range | 0.18 | | entropy_loss | -3.79 | | explained_variance | 0.979 | | learning_rate | 0.000147 | | loss | 0.405 | | n_updates | 390 | | policy_gradient_loss | -0.000325 | | std | 0.623 | | value_loss | 6.85 | ----------------------------------------- ------------------------------------------ | rollout/ | | | ep_len_mean | 1.28e+03 | | ep_rew_mean | -77.8 | | time/ | | | fps | 794 | | iterations | 50 | | time_elapsed | 4126 | | total_timesteps | 3276800 | | train/ | | | approx_kl | 0.0046639866 | | clip_fraction | 0.0593 | | clip_range | 0.18 | | entropy_loss | -3.38 | | explained_variance | 0.98 | | learning_rate | 0.000107 | | loss | 2.68 | | n_updates | 490 | | policy_gradient_loss | -0.000395 | | std | 0.563 | | value_loss | 4.07 | ------------------------------------------ ----------------------------------------- | rollout/ | | | ep_len_mean | 1.23e+03 | | ep_rew_mean | -76.5 | | time/ | | | fps | 788 | | iterations | 60 | | time_elapsed | 4983 | | total_timesteps | 3932160 | | train/ | | | approx_kl | 0.003654644 | | clip_fraction | 0.0404 | | clip_range | 0.18 | | entropy_loss | -2.97 | | explained_variance | 0.984 | | learning_rate | 6.8e-05 | | loss | 2.49 | | n_updates | 590 | | policy_gradient_loss | -0.0011 | | std | 0.509 | | value_loss | 4.62 | ----------------------------------------- ------------------------------------------ | rollout/ | | | ep_len_mean | 1.38e+03 | | ep_rew_mean | -67.1 | | time/ | | | fps | 783 | | iterations | 70 | | time_elapsed | 5854 | | total_timesteps | 4587520 | | train/ | | | approx_kl | 0.0033584852 | | clip_fraction | 0.0281 | | clip_range | 0.18 | | entropy_loss | -2.61 | | explained_variance | 0.987 | | learning_rate | 2.87e-05 | | loss | 0.573 | | n_updates | 690 | | policy_gradient_loss | -0.00113 | | std | 0.465 | | value_loss | 3.46 | ------------------------------------------
<stable_baselines3.ppo.ppo.PPO at 0x7e3697a952a0>
# Evaluamos
mean_reward, std_reward = evaluate_policy(model_final_hardcore_3, vec_env_hardcore, n_eval_episodes=10)
print(f'Mean Reward: {mean_reward} +/- {std_reward}')